Ki Data Science Resources

Author
Ryan Hafen

Preva Group

Published

July 29, 2023

1 Working With Synapse

In Ki, we use Synapse to store analysis artifacts and data. There are some tools that can make working with Synapse a bit easier.

Important

Never store individual-level data in Synapse!

1.1 synapser R package

You can use the synapser R package to interact with Synapse and do things such as receive or send files.

Here is an example of using the package to read a csv file:

# only run if not installed:
# install.packages("synapser",
#   repos=c("http://ran.synapse.org", "http://cran.fhcrc.org"))

library(synapser)
synLogin()
entity <- synGet("syn41733315")
loinc <- readr::read_csv(entity$path)

This package is difficult to install on some systems, and is difficult to run in many environments (can’t use in conjunction with reticulate or radian).

1.2 Synapse Uploader and Downloader

Synapse does not support uploading or downloading folders of files. Each file has to be individually uploaded or downloaded. In Ki we have created some tools that make batch upload and download easy and fast on Synapse. These are both Python packages that we typically use as command-line tools. Perhaps at some point we will create simple R/Reticulate wrappers.

1.2.1 Synapse Uploader

To install and use synapse-uploader, you can take a look at the README on GitHub.

Suppose for example you want to upload a the contents of a folder on your computer to a folder on Synapse.

First find (or create) the folder on Synapse that you want to upload to and note its Synapse ID (e.g. synXXXXXXXX).

Then simply call this from your command line:

synapse-uploader synXXXXXXXX /path/to/local/folder_name

If you don’t have the proper environment variables set, it will ask for your Synapse credentials. This will upload the contents of folder_name to the folder on Synapse.

1.2.2 Synapse Downloader

The synapse-downloader works similarly.

For example, to download the folder from before from Synapse to your machine, you can use the following:

synapse-downloader synXXXXXXXX /path/to/local/folder_name

More options can be found in the README.

2 Ki Service Catalog

The Ki Service Catalog is compute resources on AWS managed by Sage Bionetworks. It allows you to “centrally manage commonly deployed AWS services, and helps you achieve consistent governance which meets your compliance requirements, while enabling users to quickly deploy only the approved AWS services they need.”

This provides an platform that allows us to spin up compute environments for analysis that meet security requirements for working with some of the RWD datasets such as CPRD. It is managed by Sage Bionetworks, with whom the Foundation has specific Data Use Agreements set up.

We will provide step-by-step instructions for getting started with the Service Catalog, but you can find more details from documentation provided by Sage Bionetworks here.

Important

Before you get any further with using the Service Catalog, please note that after you create and do work on a compute environment, once you terminate that instance, all your work will be lost. You will need to be sure to continually push appropriate code and results to the appropriate places (code to GitHub or Synapse, results to Synapse). Also remember that subject-level data cannot be stored on Synapse. Typically subject-level data will be inputs that you download to your compute environment, and any outputs you produce will be summary-level data that can be stored on Synapse.

Also note that running instances will incur charges to the Foundation. Feel free to keep an instance running throughout the course of an analysis, but please be sure to terminate any instances you are not using.

2.1 Setup

There are a few one-time steps that need to be done to get set up to use the Service Catalog.

First, you need to join the BMGFKI_ServiceCatalogUsers group on Synapse https://www.synapse.org/#!Team:3432633. This assumes that you already have a Synapse account. If you do not, you can create one here.

If you are going to be using the “Notebook” instance type as described in Section 2.2.1, this is all you will need to do. However, as discussed in that section, this instance type has some major limitations (see below).

For access to the “Linux Docker” instance type described in Section 2.2.2, you will need to do a few more things. Sage Bionetworks has provided instructions here that you can follow. We will augment that with some additional details here that may be helpful.

The first step is to create a Synapse personal access token. Note that the link in the Sage documentation is not correct for Ki users. You need to visit http://bmgfki.sc.sageit.org/personalaccesstoken.

Note

Anywhere in the Sage Bionetworks instructions that you see anything starting with “https://sc.sageit.org”, replace it with “https://bmgfki.sc.sageit.org”.

Next, you need to install the AWS the Command Line Interface (CLI) tools followed by the session manager for AWS CLI

Finally, create a Synapse credentials file and update your AWS configuration so that you can get SSM access to your instance. See here for more information. This is the most difficult step. Please follow these instructions carefully. Note that Step 7 in those instructions is testing the SSM setup with a provisioned instance. We have not yet provisioned an instance so we will do this testing in Section 2.2.2.

An additional step for this setup is to add the following to your ~/.ssh/config file:

Host i-* mi-*
    ProxyCommand sh -c "aws ssm start-session --profile service-catalog --region us-east-1 --target %h --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"

One final “one-time-only” task is to make sure you have an SSH key on your local machine. If you don’t you can do so by opening either a terminal on Linux or Mac or the Command Prompt in Windows and type the following:

ssh-keygen -t rsa

When prompted for a filename, you can go with the default, “id_rsa”. When asked for a passphrase, just hit enter to leave it empty. We will use this file later.

2.2 Provisioning an Instance

Navigate to the Service Catalog portal: http://bmgfki.sc.sageit.org.

Products to choose from in Service Catalog

2.2.1 EC2: Ubuntu Linux with Notebook Software

If you would like, you can choose the “EC2: Ubuntu Linux with Notebook Software” product. A benefit of this product is that gives you an instance with very easy access to RStudio Server. However, it has a very old version of R (3.6.3) and a limited number of outdated R packages installed. It may be possible to update R, and you can update and install new R packages, but this is something that will need to be done with each new instance being provisioned.

To provision this product, select it and click the “Launch Product” button.

This will take you to a screen where you enter a name for the instance and choose the instance type and disk size.

There are many instance types to choose from. You can read more about them here. If you are experimenting, we recommend starting with a smaller instance type and working up. The Foundation is charged for the time that the instance is running, so if you choose a larger instance type, it will cost more.

The disk size is the amount of storage space you will have on the instance. The acceptable range is 8-2000 GB. The Foundation is also charged for disk usage (and the charges persist even when your instance is sleeping), so choose a reasonable number here based on your analysis needs. If you are working with small datasets, 50-100 GB is probably all you need.

Once you have made your selections, click “Launch Product”.

It will take about 5 minutes or so to build. You may need to hit the refresh button in the top right of the screen next to the “Actions” button to see the updated status (the status should change from “Under change” to “Available”). Once it has built, if you scroll to the bottom of the page, you will see a section called “Outputs”. This will contain the information you need to connect to RStudio Server running on your instance.

Link to connect to RStudio Server

Note

It is possible to update R on the Notebook server, but once it has been updated, you will no longer be able to produce plots inside the “plots” pane of RStudio Server. We have not been able to successfully update RStudio Server on these instances.

2.2.2 EC2: Linux Docker

The “EC2: Linux Docker” product allows us to create a compute instance on which we can run a pre-built Docker image. This allows us to have more control over the software environment on the instance. The downside of this is that it is more difficult to get access to RStudio Server or Jupyter notebooks. For now we will need to use SSH to connect to the instance. A really nice option with SSH, however, is to use Visual Studio Code’s Remote-SSH capabilities to do your work on the instance as if you were working on your own computer. This is difficult to set up initially but well worth it once set up. We will attempt to provide instructions for doing this below.

Note

We highly recommend using this environment with Visual Studio Code. This can serve as your “IDE” for working on the instance and has great support for both R and Python. The R support rivals and in many cases is better than RStudio, and many other tools are available such as Git and GitHub integration. A major feature that makes this so easy to use with the Service Catalog is the ability to use the Remote-SSH extension to connect to the instance as if you were working on your own computer, as well as their Development Containers feature which allows you to seamlessly use Docker containers. Here is a video that gives you a better idea of how you can use R with Visual Studio Code.

To get one of these instances running, navigate to the Service Catalog portal and choose “EC2: Linux Docker” and click “Launch Product”.

Give the instance a name.

Instance parameters

Choose the instance type and disk size. As discussed previously, the Foundation is charged based on the instance type and disk size that you choose, so be sure to assess your anticipated needs and choose accordingly.

Once your instance is provisioned, you will see a page with details about the instance.

Example of provisioned product details

You can keep track of the status of the instance by looking at the “Status” entry. You may need to refresh the page to see updated status. Your instance is ready when the status changes from “Under change” to “Available”.

Once your instance is available, you can look toward the bottom of the page to find the Instance ID.

Example of an instance ID

It should look something like i-xxxxxxxxxxxxxxxxx We need this ID to connect to the instance over SSH. If you were able to successfully follow the instructions in Step 5 of Section 2.1, you should be able to connect to the instance with the following command:

aws ssm start-session --profile service-catalog --target i-0d56c99510b0cccfe --region us-east-1

This should put you into an interactive shell on the machine. Now exit out of it by typing “exit”.

Note

This command differs from that found in the Sage Bionetworks documentation in that it explicitly specifies the region. This is necessary because my default region does not match the region where the instance was created. You may encounter the same need.

If this command does not put you into a terminal on the instance, then you will need to revisit the SSM access instructions to get this working.

With the SSM command running successfully, now you need to do the following each time you create a new instance.

First, we want to copy our public key to this instance so that we can SSH into it. Run the following in a terminal on your machine:

key=`cat ~/.ssh/id_rsa.pub`
aws ssm start-session \
  --profile service-catalog \
  --target i-0d56c99510b0cccfe \
  --region us-east-1 \
  --document-name AWS-StartInteractiveCommand \
  --parameters command="echo $key | sudo tee -a /home/ec2-user/.ssh/authorized_keys; sudo mkdir /home/ec2-user/.devcontainer; sudo wget https://raw.githubusercontent.com/ki-tools/ki-service-catalog/main/devcontainer.json -O /home/ec2-user/.devcontainer/devcontainer.json; exit"

This command makes it possible for you to use SSH instead of SSM. It also copies settings to the instance that will make it easy to get going with a data science Docker container that we have created and stored on Docker Hub.

Now you should be able to SSH into the instance with this:

ssh -i ~/.ssh/id_rsa ec2-user@i-0d56c99510b0cccfe

2.2.3 Working with Visual Studio Code on Your Instance

We now need to get our instance working with Visual Studio Code. On your local machine, open Visual Studio Code and open the command palette with Ctrl+Shift+P (or Cmd+Shift+P on a Mac). Type “Remote-SSH” and choose “Remote-SSH: Add New SSH Host…”.

Type the ssh command from before (ssh -i ~/.ssh/id_rsa ec2-user@i-0d56c99510b0cccfe - be sure to replace the instance ID with yours) and hit enter. It will ask if it should update your ~/.ssh/config file. Say yes.

Now you can connect to this host by opening the command palette and typing “Remote-SSH” and choosing “Remote-SSH: Connect to Host…”. Choose the host that you just added.

This will open a new project in Visual Studio Code. Your file explorer sidebar should look something like this:

Click “Open Folder” and choose the default option:

Now you should see a status update in the bottom left that it is setting up the connection to the server.

You may see a window pop up about trusting the files in this folder. Click “Yes, I trust the authors”.

Now in the bottom right you should see a note saying that your “Folder contains a Dev Container configuration file. Reopen folder to develop in a container”. Click “Reopen in Container”.

You should now see a status update in the bottom left that it is starting the container. This could take a few minutes as it is pulling down our Ki Data Science Docker image from Docker Hub.

The following times that you connect through a remote SSH session, you will still need to open the folder and reopen in a container, but it should be much faster as it will not need to pull down the Docker image again.

Note

The Docker container automatically mounts your home directory on the instance. This means that you can access files on the instance from within the container and that the files will not disappear if you close your Visual Studio Code project or lose your Remote SSH session. You may lose access to the R session you are running though, so you should always follow the best practice of breaking your project into tasks and saving intermediate files for these tasks that can be easily loaded in subsequent sessions.

Once you are in the container, you are working on the remote machine but it is from the convenience of your local machine. You will be working in Visual Studio Code just as if you were working locally. You can open terminals, view plot outputs, run Shiny apps, etc. You can also use the file explorer to navigate the files on the remote machine.

2.3 Getting Data In and Out of Your Service Catalog Instance

Since your Service Catalog instance is not permanent, you will need to get data from other sources and save data to other sources. We discuss some options for doing so in this section.

Important

We cannot stress enough that any data or code that you want to keep needs to be saved to a persistent location. You also need to be vigilant about the type of data that you are saving outside of the instance. While you can work with subject-level and/or PHI/PII data on the instance, you should never save this type of data outside the instance. Instead, you should save aggregated data and the code that produced that data.

2.3.1 Synapse

If the data you need to work with is on Synapse, you can use the Synapse tools we discussed in Section 1.

2.3.2 SFTP

If you followed the steps in Section 2.1, then you should be able to use SFTP to transfer files to and from your instance using SFTP. The following command is an example of how to transfer a file from your local machine to your instance:

scp -i ~/.ssh/id_rsa test.txt ec2-user@i-xxxxxxxxxxxxxxxxx:~/

2.3.3 SFTP GUI

If you want to use a GUI for transferring files to your instance, you need to find a SFTP client that supports “ProxyCommand”. One such client for MacOS is Transmit, which you need to pay for but works really well.

3 Code and GitHub

Whether working on your own machine or on the service Catalog, we recommend using GitHub to manage your analysis code for Ki analyses. There are many nice features in GitHub that make it easy to collaborate and later share your code with others in convenient ways.

Each analysis gets its own repository in the “ki-analysis” organization: https://github.com/ki-analysis/. If you don’t have access to this organization, or don’t have permission to create a new repository for an analysis, contact Ryan Hafen.

When creating a new repository, give it a meaningful name. If it is part of a rally/sprint, place the rally number and sprint letter at the beginning of the repository name. Keep the repository set to private unless there is a good case to make it public. Also be sure not to commit data files or other sensitive information to the repository.

We highly recommend following best practices when writing code, especially if working on a team or there is a good chance it will be used by others. It may seem pedantic to strongly encourage following standards, but it goes a long way in working efficiently and effectively as a team.

For R code, guidelines include the following:

  • Follow the Tidyverse style guide (see the “Analyses” section). Some highlights:
    • Use snake_case, not CamelCase or other things like dot.case, etc.
    • Use two spaces for indentation.
    • Strive to limit your code to 80 characters per line.
    • Do not use RStudio’s default “Vertically align arguments in auto-indent” for multi-line code (see here) as it makes code incredibly hard to read and edit.
    • Use Use <-, not =, for assignment.
    • Use “, not ’, for quoting text. The only exception is when the text already contains double quotes and no single quotes.
    • Prefer TRUE and FALSE over T and F.
    • In data analysis code, use comments to record important findings and analysis decisions. If you need comments to explain what your code is doing, consider rewriting your code to be clearer.
  • Use lintr in your programming environment to actively check style conformance.
  • If you find yourself repeating the same operation over and over, write a general function and use it. This will make your code more readable and will make it easier for you to make changes to the code.
  • Break your analyses into pieces with different files for different tasks whenever possible. For example, do your data preprocessing in a separate script from your analysis. If you are carrying out multiple different analyses, place the code in different appropriately-named files. If one of your analyses is very long and can logically be broken up, do so. For example, an analysis might have a lengthy process of figuring out medical codes to use. Put this in one file and save the codes out and then load those codes in your analysis script.
  • We don’t want to get too prescriptive about exactly how you should organize your files or what kind of reproducibility tools you use (such as renv, targets, Docker, etc.), but instead just encourage you to be mindful of your future self or others needing to jump back into your code. A good resource on more general considerations for making your code more reproducible can be found here.

4 Visualization Style Guide

Here are some recommendations for creating visualizations in your analyses.

  • Font size: Ensure font sizes are legible by an audience who may be looking at the figure shown in a slide deck from some distance. It is difficult to prescribe a specific guideline here as the choice of plotting framework, the size of the plot, and the size at which the plot will be shrunk or blown up can all confound a recommendation. Just be attentive and keep in mind that in our experience, more often than not, the axis labels are too small. A good rule of thumb is that the axis label font size should be as big as any text you expect the audience to read on the rest of the slide.
  • Axis labels and annotations: Label axes clearly so that the plot can stand on its own without additional explanation. Add annotations if necessary to call out interesting findings.
  • Themes:
    • A white plot background is preferred with light grid lines. When working with ggplot2, we recommend using theme_minimal() or theme_bw() for your plots. We lean toward theme_bw() for faceted plots and theme_minimal() for unfaceted plots.
    • Colors: We recommend using the “Tableau 10” color scheme for categorical data and the viridis color schemes for continuous data. The tableau 10 color scheme can be found in ggthemes::tableau_color_pal("Tableau 10")(10) or with utility functions for ggplot2: ggthemes::scale_color_tableau(), etc.
  • Aspect ratio: Think about the physical aspect ratio (height of plot vs. width of plot) of what you are plotting. Choice of aspect ratio should not be dictated by the space you to fill in your slides, but on what is the best way to present the data. See here for some discussion on this.
  • File format: Save your plots in high-resolution png or use vector formats such svg or pdf when possible. This will help ensure that the embedded plot will maintain a good level of resolution regardless of the platform or operating system others are using.
  • As-is: Do everything you can to make the resulting plot “final” and usable in a presentation as-is. Extra cropping, labels, etc. ideally should not need to be done manually after the plot has been created, as any time you need to update the plot, there will be an undocumented process of extra manual steps that will need to occur every time.
  • Reproducibility: There is a good chance that others might need to run your code to recreate plots at some point in the future. Make sure your scripts are self-contained and reproducible. If creating a plot requires an undocumented process of running code across multiple files, no one will know how to make your code work, including your future self. Make sure all data dependencies are declared in your script and are available to anyone else who might be running your code.
  • Sharing: If you need to share your code with someone else who needs to tweak or modify the plot outputs, it can be useful to save the intermediate data set that is directly used in the plot commands, so that the other person does not need to re-run all of the data preparation commands.